• How do you add new features in a scikit-learn pipeline with a ColumnTransformer?

    I came across this pipeline setup where feature engineering is being added before a ColumnTransformer, but the new features don’t seem to flow correctly through the pipeline: from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.base import BaseEstimator, TransformerMixin class FeatureAdder(BaseEstimator, TransformerMixin): def fit(self, X, y=None): return self def(Read More)

    I came across this pipeline setup where feature engineering is being added before a ColumnTransformer, but the new features don’t seem to flow correctly through the pipeline:

    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.base import BaseEstimator, TransformerMixin
    
    class FeatureAdder(BaseEstimator, TransformerMixin):
        def fit(self, X, y=None):
            return self
        
        def transform(self, X):
            X['new_feature'] = X['col1'] * X['col2']
            return X
    
    pipeline = Pipeline([
        ('feature_add', FeatureAdder()),
        ('preprocess', ColumnTransformer([
            ('num', StandardScaler(), ['col1', 'col2']),
            ('cat', OneHotEncoder(), ['col3'])
        ]))
    ])
    

    The issue is:

    • The newly created new_feature is not included in the ColumnTransformer

    • This leads to it being dropped during transformation

    In a setup like this:

    • Should the ColumnTransformer be dynamically updated to include new features?

    • Or is it better to handle feature engineering outside the pipeline altogether?

    • How do you ensure feature consistency without breaking pipeline modularity?

  • How to handle imbalanced datasets effectively in classification problems?

    I’m working on a classification problem where one class heavily outweighs the others (around 90:10 ratio). My model is achieving high accuracy, but it’s clearly biased toward the majority class. Here’s a simplified version:   from sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import classification_report X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model =(Read More)

    I’m working on a classification problem where one class heavily outweighs the others (around 90:10 ratio). My model is achieving high accuracy, but it’s clearly biased toward the majority class.

    Here’s a simplified version:

     
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import classification_report

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    model = RandomForestClassifier()
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred))

     

    Accuracy looks good, but recall and precision for the minority class are poor.

    What I want to understand:

    • What are the best techniques to handle imbalance (SMOTE, class weights, etc.)?
    • When should I prefer resampling vs adjusting model parameters?
    • Which evaluation metrics should I focus on in such cases?

    Would appreciate practical advice based on real-world experience.

     
     
  • How do you handle model performance degradation after deployment?

    Many models perform well during training and validation but start degrading in production due to data drift, concept drift, or changing user behavior. What monitoring strategies, retraining pipelines, or evaluation practices do you use to maintain model performance in production environments?

    Many models perform well during training and validation but start degrading in production due to data drift, concept drift, or changing user behavior. What monitoring strategies, retraining pipelines, or evaluation practices do you use to maintain model performance in production environments?

  • Why does model performance drop when using time-based train-test splits?

    I’m working on a data science project with time-ordered data, and I’m seeing a significant drop in model performance once I move from training to validation. I’m sharing a simplified version of the problem and code below. The dataset represents events over time, and the target is binary. I initially used a random train-test split,(Read More)

    I’m working on a data science project with time-ordered data, and I’m seeing a significant drop in model performance once I move from training to validation. I’m sharing a simplified version of the problem and code below.

    The dataset represents events over time, and the target is binary. I initially used a random train-test split, but later switched to a time-based split to better reflect real-world usage. After this change, performance dropped sharply, and I’m trying to understand whether this is expected or if I’m doing something wrong.

    Here’s a simplified version of the code:

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import roc_auc_score
    
    # sample data
    df = pd.read_csv("data.csv")
    df = df.sort_values("event_time")
    
    X = df.drop(columns=["target"])
    y = df["target"]
    
    # time-based split
    split_index = int(len(df) * 0.8)
    X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
    y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]
    
    model = RandomForestClassifier(random_state=42)
    model.fit(X_train, y_train)
    
    preds = model.predict_proba(X_test)[:, 1]
    print("AUC:", roc_auc_score(y_test, preds))
    

    With a random split, the AUC was around 0.82.
    With the time-based split, it drops to around 0.61.

    I’m trying to understand:

    • Is this performance gap a common sign of data leakage in the original setup?

    • Are tree-based models like Random Forests particularly sensitive to temporal shifts?

    • What are good practices to diagnose whether this is concept drift, feature leakage, or simply a harder prediction problem?

    • Would you approach validation differently for time-dependent data like this?

    Looking for general guidance, validation strategies, or patterns others have seen in similar scenarios.

     

     

  • Future of Data Science Moving Away From Modeling and Toward Problem Framing?

    Data science as a discipline is shifting faster than most people realize. A decade ago, the core skill set revolved around building models, tuning hyperparameters, crafting feature pipelines, and selecting algorithms. But with the rise of AutoML, pretrained foundation models, vector databases, and agentic AI systems, much of the “technical heavy lifting” is becoming automated(Read More)

    Data science as a discipline is shifting faster than most people realize. A decade ago, the core skill set revolved around building models, tuning hyperparameters, crafting feature pipelines, and selecting algorithms. But with the rise of AutoML, pretrained foundation models, vector databases, and agentic AI systems, much of the “technical heavy lifting” is becoming automated or abstracted away.

    Today, the competitive advantage is less about who can write the best model from scratch and more about who can frame the right problem, define meaningful metrics, interpret model outputs responsibly, design data loops, and understand the business impact of predictions. Even the most complex models LLMs, multimodal architectures, time-series forecasters can now be deployed with pre-built frameworks or API calls.

    This shift raises an important question about the future of the field:
    If modeling becomes commoditized, does the true value of a data scientist lie in strategic thinking rather than technical implementation?

Loading more threads